Ordered message reception in a distributed data processing system

ABSTRACT

A complex computing system has a plurality of nodes interconnected by channels through which data messages are exchanged. The underlying principle is that after arrival at a node of a message, delivery of that message is delayed until after delivery and consequences of all more senior messages which affect the node. The messages are progressively timestamped at each node so that each time stamp contains generation by generation indicators of the origin of the associated message. The seniority of that message is uniquely determined thereby and total ordering of the messages can be achieved. When comparing timestamps for such ordering, comparison of respective generation indicators is necessary only until there is a distinction.

This invention relates to complex computing systems. It was developedprimarily to answer a problem with distributed systems, but it has beenrealised that it is equally applicable to systems which, are notnormally considered to be distributed, such as a multi-processorcomputer. Although their physical separation may be negligible,nonetheless the processors are distinct and form a “distributed” systemwithin the computer to which this invention is applicable.

A landmark paper on distributed systems is that of Lamport (“Time,Clocks and the Ordering of Events in a DistributedSystem”—Communications of the ACM Vol. 21 No. 7, 1978 pp 558-565). Inthat, a distributed system is defined as a collection of distinctprocesses which are spatially separated and which communicate with oneanother by exchanging messages, and in which the message transmissiondelay is not negligible compared to the time between events in a singleprocess. In such a system, it is sometimes impossible to say that one oftwo events occurred first, Lamport proposed a logical clock to achieve apartial ordering of all the events, and he postulated a single integertimestamp on each message, corresponding to the time the message wassent.

Fridge (in “Logical Time in Distributed Computing Systems”—IEEE Computer24(8) August 1991 pp 28-33) argued that the time stamps of Lamportclocks (totally ordered logical clocks) impose on unrelated concurrentevents an arbitrary ordering, so that the observer cannot distinguishfrom genuine causal relationships. He proposed partially ordered timereadings and timestamping rules which enable a causal relationshipbetween two events to be established. Their order could then bedetermined. But where there is no causal relationship between events, nodefinitive order exists, and different total orderings of events (orinterleavings) are possible. This means that some messages are assignedan arbitrary order.

This ordering problem is known as the “race condition problem” and itcan be illustrated by a simple analogy. A dictates a first message tosecretary B, who faxes the typed version to C. A telephones C with asecond message. Unless ordered, the communication system will not knowwhether the first or second message reached C first, although it willknow that the dictation preceded the fax.

It is the aim of this invention to resolve this problem and to allowsomeone to programme a distributed system as if he was programming auni-processor. In other words, he can think about time linearly and hewill not have to be concerned about concurrency or the race conditionproblem.

According to one aspect of the present invention there is provided acomplex computing system comprising a plurality of nodes connected toeach other by channels along which timestamped data messages are sentand received, each timestamp being indicative, generation by generation,of its seniority acquired through its ancestors, arrival in the systemand in any upstream nodes, and each node comprising: means; for storingeach input data message, means for determining the seniority of inputdata messages by progressive comparison of respective generations in thetimestamps until the first distinction exists, means for deliveringthese messages for processing, means for applying a timestamp to eachoutput message derived from such processing comprising the immediatelyancestral message's timestamp augmented by a new generation seniorityindicator consistent with the ordering, and means for outputting suchordered and timestamped messages.

The delivery means will generally be arranged to deliver messages inorder according to which message has the most senior timestampindicator.

For a data message received from outside the system the initialtimestamp indicator will preferably include an indication of the time ofreceipt of said data message at the node, while for a data messagegenerated by a node of the system the new generation seniority indicatorof the timestamp will preferably include an indication of the place ofsaid data message in the ordered sequence of such messages at said node.This indication may be real time or logical time.

Conveniently, monotonic integers are utilised as said generationseniority indicators in the timestamps.

Advantageously, the delivery means of a node delivers data messages onlyeither once a message has been received on each of the input channels ofsaid node or when at least one data message received on each of theinput channels of said node is stored in the storage means.

Preferably each node will be adapted to perform at least one channelflushing routine triggerable by lack or paucity of channel traffic.

Ideally, all data messages caused by a first data message anywhere inthe system will be delivered to a node before any messages caused by asecond data message, junior to the first data message, are delivered tosaid node.

According to another aspect of the present invention there is provided amethod of ordering data messages within a complex computing systemcomprising a plurality of nodes connected to each other by channelsalong which data messages are sent and received, the method comprising,for each node, timestamping each message on arrival, queuing messagesuntil a message has been received on each input channel to the node, anddelivering the queued messages for processing sequentially in accordancewith their timestamps, the message having the most senior timestampbeing delivered first, wherein the timestamping at each node iscumulative so that the timestamp of a particular message indicates theseniority acquired by that message, generation by generation, andwherein the seniority of one message against another is determined bythe progressive comparison of respective generations in the timestampsuntil the first distinction exists.

According to a further aspect of the present invention there is provideda complex computing system comprising a plurality of nodes between whichdata messages are exchanged, wherein after the arrival at a node of amessage, delivery of the message by the node is delayed until after thedelivery and consequences of all more senior messages which affect thenode.

Such a system may be either a distributed computing system, a symmetricmulti-processor computer, or a massively parallel processor computer.

Assumiptions

To understand later explanations, certain assumptions about adistributed computing system will be set out.

Such a system is a set of nodes or processes connected by FIFO (firstin, first out) channels. Conventionally, ‘nodes’ refer to the hardwareand ‘processes’ to the software and operations performed at the nodes,but the terms may be used interchangeably here. Some of these processeshave external channels through which they communicate with the system'senvironment, the whole system being driven by input messages throughsome external channels, and sending out an arbitrary number ofconsequential output messages through other external channels.

Each process can be regarded as an application layer and a presentationlayer, which handle the following events:

(a) Message arrival (at the presentation layer)

(b) Message delivery (from presentation to application layer)

(c) Message send request (from application to presentation layer)

(d) Message send (from presentation layer)

(e) Message processing complete (from application to presentationlayer).

At any event (b), the application layer

i) generates one or more events (c)

ii) changes the process state, and

iii) generates event (e)—which indicates that it is ready to receive afurther message.

A set of such events will be termed a message handler invocation. Suchinvocations are the basic building blocks or atomic units of adistributed system, and a process history is a sequence of suchinvocations. Each invocation may affect subsequent invocations bychanging the internal state of the process.

At the application layer, the channels are simplex. However, auxiliarymessages, from one presentation layer to the other, are allowed in bothdirections.

There will be a global real-time clock, accessible from anywhere in thesystem. It is required only to be locally monotonic increasing, andthere will be some bound on the difference between two simultaneousreadings on the clock in different processes.

The FIFO channels are static.

Processes do not generate messages spontaneously.

Each message in the system has exactly one destination.

(These last three assumptions are working hypotheses which will berelaxed later).

Finally, for initial consideration, there are no loops in the possibledataflows. This will be discussed further below.

The Time Model

The aim is to achieve a total ordering of the set of messages in thesystem. If messages are delivered to every process in time order andmessages are sent along every channel in time order then the system issaid to obey “the time model”. A total ordering of the set of messagesis equivalent to an infective mapping from the set of messages in thesystem to a totally ordered set (the time-line).

In this specification “<” will signify, in the relationship m_(p)<m_(q),that message m_(p) precedes message m_(q) in the total order. This givesthe first principle of the time model: there is a unique time foreverything. Such a time is simply a label useful for evaluating the timeorder relationiship between messages, and does not have any necessaryrelationship with real time.

The relation “<” is based on two partial order relatioins, “”, (sentbefore) and “→” (strong causality), as explained below.

For any two external input messages, m₀ and m₁, either m₀m₁ or m₁m₀.This is given by the environment, typically by the clock time of messagearrival. In other words, external input messages are totally orderedwith respect to .

If message send requests for messages m₀ and m₁ occur during the sameinvocation at the behest of some third message, the send request for m₀being before m₁, then m₀m₁.

“→” is the least partial order such that if the message send request form₁ occurs during the invocation in response to m₀, then m₀→m₁. (i.e. amessage strongly causes any messages sent by its handler).

The total order relation “<” is determined by the following axioms:

If m₀m₁ then m₀<m₁.

If m₀→m₁ then m₀<m₁.

If m₀m₁, m₀→m′₀ then m′₀<m₁.

The first two axioms correspond to Lamport's axioms; the third, thestrong causality axiom, is the heart of the time model of the presentproposal.

The idea behind the strong causality axiom is the following: if aprocess or the system's environment sends two messages (m₀ and m₁), oneafter another, then any consequence (m′₀) of the first message (m₀)should happen before the second message (M₁) and any of itsconsequences.

This gives the second principle of the time model: there is enough timefor everything (i.e. enough time for all remote consequences to happenbefore the next local event).

For a better understanding of the invention reference will now be made,by way of example, with reference to the accompanying drawings, inwhich:

FIG. 1 is a diagram illustrating the total ordering of messages,

FIG. 2 is a diagram of a process with its time service,

FIG. 3 illustrates a message sequence of the time service,

FIG. 4 shows a network of processes, to explain channel flushing,

FIG. 5 is a diagram showing channel flushing messages and the structureof time service,

FIG. 6 illustrates a message sequence of channel flushing,

FIG. 7 shows a bus,

FIG. 8 shows a delay,

FIG. 9 shows a false loop, and

FIG. 10 comprises diagrams of feedback through a bus.

Referring to FIG. 1, the diagram can be likened to a tree on its sidewith its root (to the left) representing the system's environment whichgenerates external messages. The nodes (the vertical lines) are messagehandler invocations and the arrowed horizontal lines represent messages.

Using this tree it is easy to reconstruct the message relations. Forexample, a→b because b was sent while a was handled; bc because b wassent before c in the same invocation. Also, a→f and xz as theserelations are transitive. f and e are incomparable under both “” and“→”; nevertheless f<e. To compare two messages with respect to the totalorder relation “<” one has to trace paths from the root to thesemessages. There can be three possible cases, which correspond to thethree axioms. They are shown by the following three examples taken fromFIG. 1:

c lies on the path of d, hence, c→d and therefore c<d.

x and z have the same path, but x is sent before z, so xz and x<z.

f and e have the same path prefix, but then their paths fork, and b(with b→f) is sent before c (with c→e) which means bc, so f<c<e, givingf<e.

If a distributed system follows this time model, i.e. if messages aredelivered to each process in this order, and sent down each channel inthis order, then the system's behaviour will be deterministic,independent of the speed of processes and channels.

A possible time-line from which each message can be given a unique timeis the set of sequences of integers. These are ordered using thestandard dictionary ordering.

The path from the root to a message fully identifies the messageposition in the total order relation “<”. This path can be codified as arepresentation of time in the distriLbuted system. The names ofprocesses along the path are immaterial the only information needed isthe relative order of the message ancestors at each process and theorder of the initial external messages. So, in FIG. 1, the time can berepresented by an array of integers, e.g. [1,2,3,1,2] for e, [2,3] forz, [2,2,2,1,1] for y. However, since the system's environment may bedistributed, it could be difficult to assign unique integers to eachexternal message. A possible solution is to use real clock valuescombined with an external input identifier, this requiring that at everyexternal input all real clock readings are unique and growmonotonically.

The following C++ class can be used for time:

class TTime ( friend TTime bool operator<( TTime t0, TTime t1 ) {public:  if( t0.RealClock < t1.RealClock) return TRUE;  TTime():  if(t1.RealClock < t0.RealClock ) return FALSE;   RealClock( 0.0 ),  if(t0.Input < t1.Input ) return TRUE;   Input( 0 ),  if( t1.Input <t0.Input ) return FALSE;   Length(0) {}  for( unsigned i=0; i<min (t0.Length, t1.Length ); i++ ) {  TTime( float realclock, unsigned input):   if(t0.Path[i] < t1.Path[i] return TRUE;   RealClock( realclock ),  if( t1.Path[i] <t0.Path[i] ) return FALSE;   Input( input ),  }  Length(0) {}  return t0.Length < t1.Length;  void AddNewProcess{} }  {Path[ Length++ ] = 0;}  void operator++() { Path[Length−1]++; } friend bool operator<( TTime t0, TTime t1 ); private:  float RealClock; unsigned Input;  unsigned Path[ MAX_PATH ];  unsigned Length;

The more processes handle a message, the longer its path grows.Potentially, if there is a cycle in the system, paths can becomearbitrarily long.

The implementation should use (when possible) dynamic allocation toavoid the arbitrary upper limit on path

Each node or message handler invocation in the distributed system isstructured as shown in FIG. 2. All functionality related to support forthe time model resides in the time service, so that a process does notknow anything about the time model. Each time it finishes handling amessage it informs its time service (Done signal or event (e) above).The time service has a local clock T of type TTime which is updatedwhenever a message is received or sent by its process. Initially thelocal clock has a value given by a default constructor and it is keptand used even while the process is idle.

The “timestamp assigner” at the border with the system's environment hasa real clock synchronized with the clocks of all other timestampassigners, and a unique input identifier. ‘Synchronized’ here isunderstood to mean adequately synchronized, for example by means of thenetwork time protocol as described in the Internet Engineering TaskForce's Network Working Group's Request for Comments 1305 entitled‘Network Time Protocol (Version 3) Specification, Implementation andAnalysis’ by David L Mills of the Univ. of Delaware published by theIETF in March 1992. Each time an external message enters the system itgets a unique timestamp constructed from these two values (see thesecond constructor of TTime). It is assumed that the real clockprogresses between each two messages.

Input messages are not delivered to the process until there are messagespresent on all inputs. Once this condition holds, the local clock is setto the timestamp of the most senior message, the new process is added tothe path, and the most senior message is delivered. Every output messagesent by the process while handling this input is timestamped by the timeservice with the current value of the local clock; and then the clock isincremented. The next message can be delivered only after the processexplicitly notifies the time service that it has finished with theprevious one, is idle and waiting (Done). The corresponding messagesequence is shown in FIG. 3. The basic algorithm Alg. 1 of the timeservice is shown in the table below.

Initial slate: Idle. Event Action Slate Idle Input message arrives ||(there are messages on all Inputs) // If all inputs are non-empty, DeliverTheOldesIMessage(); // delivery is possible State HandlingMessage Process sends output message Send it with the timestamp T; T++;//increment the last time in the timestamp Done ||( there are messageson all Inputs) { // If all inputs are still non-empty, DeliverTheOldesIMessage(): // deliver the next oldest message  return;} Next state = Idle; Functions vold DeilverTheOldesIMessage() {  T =timestamp of the oldest message:  // First, the local clock is set tothe value of the oldest  T.AddNewProcess();  // timestamp, and a newprocess is added to the path in u  Deliver( the oldest message):  Nextslate = Handling Message; }

This algorithm ensures that input messages are delivered to each processin the order of their timestamps, and that output messages are sent byeach process in the order of their timestamps. Thus, the time service asdescribed above fully implements the time model.

However, a distributed system containing a cycle will not work, as alltime services in the cycle will always be missing at least one inputmessage. Also, rare messages, either on an external input or on aninternal process-to-process connection, may significantly slow down thewhole system.

Channel flushing can solve both these problems. Channel flushing is amechanism for ensuring that a message can be accepted. The principle isto send auxiliary messages that enable the time service to prove that nomessage will arrive which is earlier than one awaiting delivery. Hencethe waiting message can be delivered.

There are two kinds, namely ‘sender channel flushing’, in which thesending end initiates channel flushing when the channel has been leftunused for too long, and ‘receiver channel flushing’, in which thereceiving end initiates channel flushing when it has an outstandingmessage that has been awaiting delivery for too long.

Receiver channel flushing will be considered first, in conjunction withFIG. 4. For simplicity, timestamps and clocks are represented by singleintegers.

Suppose for a certain period of time the two lower inputs of the processC are empty while there is a message with timestamp 23 waiting on theupper input. The time service of C wants to deliver the message as soonas possible, but it cannot do so until it proves that those messagesthat will eventually arrive on the empty inputs will have greatertimestamps.

To prove it, C sends a channel flushing request “May I accept a messageof time 23?” to B and F, both of which have to forward this requestdeeply into the system, until either a positive or negative response canbe given. In fact, to verify that C can accept the message with thetimestamp 23 in FIG. 4, it is enough to ask only the processes shown,since all inputs to the diagram are at times later than 23.

The algorithm described below is a straightforward implementation ofchannel flushing. All channels in the system (which actually connect thetime services of the processes) are bi-directional, since, besides thenormal uni-directional messages, the channel flushing messages are sentalong them in the reverse direction. These messages and the structure ofthe time service are shown in FIG. 5.

The general idea is that each time the time service discovers that thereare input messages waiting while some inputs are empty, it sets a flushtimer. On the timeout event it starts the channel flush. It sends flushrequests to all empty inputs, creates a (local) request record with thelist of these inputs, and then waits for responses. If positiveresponses come from all inputs, the oldest message is delivered. If anegative response comes on any input the flush is cleared andre-scheduled.

Requests from other time services are handled in the following way.First, the time service tries to reply using its local information (itslocal clock and the timestamps of waiting messages). If it is unable todo so, it creates a (remote) request record and forwards the request toall empty inputs. If all responses are positive, so is the one to theremote requester. Otherwise, the response is No.

The algorithm Alg. 2 is presented in a table below. Again, the timeservice has two states: Idle and Handling Message. While in the latterstate, the time service is only serving process output messages, whereasin the Idle state it does all channel flushing work for both itself andother processes. <t, Path, Inputs>represents a request record. [] is anempty path. New, Delete and Find are operations over an array of requestrecords that maintain the local state of the receiver channel flushalgorithm. P is the identifier of this process. T is the local clock.

Initial slate: Idle, Flush timer not set no request records. EventAction State Idle Input message for( all <I, Path, inputs> ): Path ≠ [])// First update remote requests.  with time 4  if( I < 4) // Input i isyounger than the request's time.  arrives at input i   YesForRequest (<t, Path, Inputs>, i );  else // Input is older than the request's time.  NoForRequest ( <1, Path, Inputs > ); if( there are messages on allInputs ) ( // Then, if all inputs are non-empty,  CancelLocalFlushing(); DeliverTheOldestMessage(); // delivery is possible. return: } if( thenew message is the oldest one ) { // if this message becomes the CancelLocalFlushing(); // oldest one, a new focal  Set flush timer; //flushing must be scheduled.  return: } if ( Find( < t, [], Inputs>)) //Otherwise, if there is a local request waiting  YesForRequest ( < t, [],Inputs >, l ): // for this input, that means Yes. Flush timeoutStartFlusing( timestamp of the oldest message, []); // Empty return pathindicates that the request is local. <Your Next Time?, t, if( P is among[P_(o)...P_(n))||t<T) ( // If request has made a cycle - assume [P_(o)...P_(n)] > Yes.  flush request  Send to output to P_(n): <Yes,t, [P_(o)...P_(n)]>; // Or, if local clock is already  return; // aheadof t, definitely Yes. } if( there is a message older than t on someinput) (  Send to P_(n): < No, t, [P_(o)...P_(n)]>; // Then assume No. return; } // This process is not able to answer immediately and itstarts flushing. StartFlushing(t, [P_(o)...P_(n)]); < Yes, t,[P_(o)...P_(n)]> if( Find( <t, [P_(o)...P_(n)], Inputs >)) +ve responseon input  YesForRequest ( <t, [P_(o)...P_(n)], Inputs >, i) i < No, t,[P_(o)...P_(n)]> if( Find( <t, [P_(o)...P_(n)], Inputs > ))  negativeresponse  NoForRequest ( < t, [P_(o)...P_(n)], inputs > )

Slate Handling message Process sends output Send it with the timestampT; message Done if( there are messages on all Inputs ) ( // If allinputs are still non-empty.  DeliverTheOldestMessage(); // deliver thenext oldest message.  return; } if( there is a non-empty Input) //Otherwise, if there is a non-empty input,  Set flush timer; // a newlocal flushing must be scheduled. Next state = idle; Functions voidDeliverTheOldestMessage() {  T = maximum(T, timestamp of the oldestmessage); // Before delivery, the local clock is set  T.AddNewProcess();// to the value of the oldest timestamp, and  Deliver( the oldestmessage); // a new process is added to the path in this value.  Nextstate = Handling Message; } void CancelLocalFlushing() { // Cancellinglocal flushing activity includes  Cancel flush timer; // cancelling theflush timer (just in case it is set)  if( Find( <t, [], inputs > )} //and deletion of a local request record (if any).   Delete( <t, [],Inputs > ); } void StartFlushing( TTime I, TPath path){ // Flushing uponlocal or remote request starts with  for(all empty inputs) // sendingrequests to all empty inputs   Send to input: < Your next time?, t,Path+P >, // (P is added to the return path)  New( <t, Path, Set ofempty inputs > ); // and creation of a new request record. } voidYesForRequest ( TRecord < t, Path, Inputs >, Tinput |) { // On positiveresponse on input i  if( i g inputs ) // If the record has alreadyreceived this   return; // information then it doesn't care.  Inputs =Inputs \ i; // The input is removed from the Inputs set of the requestrecord.  if ( Inputs == ) { // If this set becomes empty, no moreresponses are needed.   Delete( < I, Path, Inputs > );   if( Path == [])// and if it is a local request    DeliverTheOldestMessage(); // theoldest message is delivered.     // Successful channel flushing.   else// Otherwise it is a remote request.    Send to the last process in thePath: < Yes, t, Path >; // Yes is sent to the next process  } // in thereturn path. } void NoForRequest( TRecord < t, Path, Inputs > ) { //Negative response - acted on immediately.  Delete( < 1, Path, Inputs >); // The corresponding record is deleted.  if (Path == []) // If it isa local request   // Unsuccessful channel flushing   Set flush timer; //then restart the flush timer,  else // Otherwise, for a remote request.  Send to the last process in the Path: < No, t, Path >; // No is sentto the next process } // in the return path.

To illustrate the work of the algorithm, consider the example in FIG. 4in conjunction with one possible channel flushing message sequence asshown in FIG. 6. “?” denotes here “Your Next Time?”, “Y” stands for Yes,and “N” for No. Sets near the vertical axes represent the sets ofprocesses from which responses are still wanted (Inputs in the aboveterminology). “ok” means that a process is sure it will not be sendinganything older than the timestamp of the request, (i.e. 23). “loop”means that a process has found itself in the return path of the request.

It is evident that the channel flushing procedure consists of two waves:a fanning out request wave and a fanning in response wave. The requestwave turns back and becomes the response wave as soon as the informationneeded for the initial request is found.

The “timestamp assigner” at the border with the environment treats thechannel flushing request in the following way.

RealClock() returns the real clock reading; Input is the unique externalinput identifier. Event Action External input message arrives Send itwith the timestamp TTime( RealClock(), Input): < Your Next Time7, t,(P_(o)...P_(n))> if( t < (TTime( RealClock(), Input )) // lft is olderthan local time  flush request  Send < Yes, t,(P_(o)...P_(n)>; // Then definitely Yes) else  Send < No, t,(P_(o)...P_(n))>; // Otherwise assume No

Sender channel flushing is conceptually simpler, and significantly moreefficient than receiver channel flushing, although it does not provide asolution to the loop problem.

In sender channel flushing, each output channel of each process has atimeout associated with it. This timeout is reset each time a message issent down the channel. The timeout can be either a logical timeout (i.e.triggered by some incoming message with a sufficiently later timestamp)or a physical timeout. If the timeout expires before being reset then asender channel flush is initiated down that channel. The channel flushconsists of a ‘non-message’ which is sent down the output channel. Thereceiver can use it to advance the time of the channel by allowing thetime service to accept earlier messages waiting on other channels. Whenthe non-message is the next message to be accepted, then the timeservice simply discards it. However, the non-message, by advancing thereceiver's local clock, can cause logical timeouts on output channels ofthe receiver; hence causing a cascading sender channel flush.

The timestamp assigners also participate in sender channel flush; theyhave to use a physical timeout. In general, using both sender andreceiver channel flushes is recommended; preferably with some senderchannel flushes piggy backed upon receiver channel flush responsemessages.

To provide usable middleware implementing the time model it is necessaryto relax some of the more restrictive assumptions about the system beingbuilt. Three special processes that need to be created and integratedwith such middleware are now considered, as is a full treatment of loopsin the dataflow.

The Bus

The bus is a process that allows multiple output ports from any numberof processes to be connected to multiple input ports on other processes.The term ‘bus’ is taken from Harrison (A Novel Approach to EventCorrelation, Hewlett-Packard Laboratories Report No. HPL-94-68 BristolUK 1994) and is intended to convey the multiple access feature of ahardware bus.

The bus implements a multicast as a sequence of unicasts, its operationbeing shown in FIG. 7.

The output channels are ordered, (shown in the diagram as 1 2, 3, 4).When a message is delivered by the time service to any of the inputchannels, the bus outputs an identical message on each of its outputchannels in order. The time service computes the timestamp for these inthe normal way, as shown.

A bus acts as a message sequencer, ensuring that all recipients of aseries of multicasts receive the messages in the same order (as shown).

The Delay

In a non-distributed system it may be possible to set a timer, and thenhave an event handler that is invoked when the timer runs out. Thisalarm can be seen as a spontaneous event. Within the time model, it mustbe ensured that spontaneous events have a unique time. The simplest wayof achieving this is to treat spontaneous events just like externalevents. A timestamp is allocated to them using a real clock and a uniqueinput identifier. Moreover, a process can schedule itself a spontaneousevent at some future time which again will get a timestamp with realpart coming from the scheduled time. Having thus enabled the schedulingof future events the delay component can be created as schematicallyshown in FIG. 8.

For each input message the delay generates an output message at someconstant amount of time, δ, later. The time of the generated message isgiven by the sum of δ and the first time in the timestamp (the real timepart). The rest of the path part of the timestamp is ignored. The inputidentifier part of the timestamp is changed from the original, 1, to theinput identifier of the delay, 1′. There are large efficiency gains fromfully integrating delays with the receiver and sender channel flushalgorithms. The responses to flush requests should take the length ofthe delay into account, as should any relaying of flush requests throughthe delay.

The Plumber

The plumber (the topology manager) is a special process that manages thecreation and destruction of channels and processes. The plumber has twoconnections with every process in the system. The first is for theprocess to make topology change requests to the plumber; the second isfor the plumber to notify the process of topology changes that affect it(i.e. new channels created or old channels deleted). The plumber cancreate and delete processes that have no channels attached. The plumberhas a specific minimum delay between receiving a topology change requestand doing it. This is the key to a feasible solution to topology changeswithin this time model. The reason that topology changes are difficultfor (pessimistic implementations of) the proposed time model is that fora process to be able to accept a message it must know that it is theoldest message that will arrive. If the topology is unknown then allother processes within the application must be asked if they might sendan older message. This is implausible. The plumber acts as the singlepoint one needs to ask about topology changes. Moreover, the minimumdelay between the request for a topology change and its realisationensures that the plumber does not need to ask backward to all otherprocesses. For large systems, or for fault tolerance, multiple plumbersare needed, and these can be arranged hierarchically or in apeer-to-peer fashion. As with the delay process the plumber needs to beintegrated with the channel flush algorithms.

Loops

Loops generate issues for the time model and typical loops generate aneed for many auxiliary messages. A loop without a delay can only permitthe processing of a single message at any one time anywhere within theloop (it is said that the loop “locksteps”). Three solutions to theseproblems are examined.

Removing Loops

The traditional design models, client/server, master/slave, encourage acontrol driven view of a distributed system, which leads to loops. Amore data driven view of a system, like the data flow diagramsencouraged in structure analysis, is typically less loopy.

Moreover, where a first cut has loops, a more detailed analysis of adistributed system may show that these loops are spurious. Thedata-flows, rather than feeding into one another, feed from onesubmodule to another and then out. For example, in FIG. 9, there is anapparent loop between process A and process B, (a flow from A feeds intoB which feeds back into A). But when one looks at the sub-processes A1,A2, B1, B2 there are, in fact, no loops, only flows.

Co-locate the Processes in a Loop

It there is a loop for which other solutions are not appropriate, itwill be found that only one process within the loop can be operationalat any one time. It will normally be better to have this, and put allthe processes in the loop on the same processor. There will be nopenalty in terms of loss of parallelism. This approach will minimise thecost of the auxiliary messages, because they will now be local messages.

Break the Loop Using a Delay

Informally, the problem with a loop is feedback. Feedback happens whenan input message to a process strongly causes another message (thefeedback) to arrive later at the same process. Under the strongcausality axiom, feedback is strongly caused by the original messages,and hence comes before all subsequent messages. Hence any process in aloop must, after processing every message, first ascertain whether thereis any feedback, before proceeding to deal with any other input. A delayprocess is a restricted relaxation of strong causality, since each inputto the delay does not strongly cause the output, but rather schedulesthe output to happen later. Hence, if there is a delay within the loop,then a process can know that any feedback will not arrive until afterthe duration of the delay. Hence it can accept other messages arrivingbefore the feedback.

A Difficult Case

The example in FIG. 10 presents specific problems of both semantics andimplementation for feedback.

In each of the four cases we see a message with data a arriving at a busB and being multicast to processes A and C. A responds to the message aby outputting a message with data β which is fed back into the bus, andhence multicast to A and C. When it arrives at A no further feedback isproduced. If the bus sends to C before A (FIG. 10a) then no issuesarise: the original message is multicast to both parties, and then thefeedback happens and is multicast.

If, on the other hand, the bus sends to A before C then he feedbackhappens before the original message is sent to C. The order in which Csees the feedback and the original message is reversed (FIG. 1b). Thisindicates that strong causality and feedback require a re-entrantprocessing similar to recursive function calls. Such re-entrantprocessing breaks the atomicity of invocations and also needssignificantly more channel flushing messages than the non-re-entrantalgorithm that has been presented. The simplest form algorithm Alg. 1would incorrectly output the later message from B to C before theearlier message (FIG. 10c). Without re-entrant processing there is aconflict between strong causality and the sent after relation. The laterversion algorithm Alg. 2, refines the DeliverTheoldestMessage functionto ensure that all incoming messages are delayed (with the minimumnecessary delay) until after the previous output (FIG. 10d). Thisimplementation obeys the time model, but (silently) prohibitsnon-delayed feedbacks. At the theoretical level this obviates thenecessity for re-entrancy and prefers the sent after relation to strongcausality. At the engineering level, this can be seen as a compromisebetween the ideal of the time model and channel flushing costs.

A slight more exhaustive account of the above is given in the prioritydocuments accompanying this Application.

What is claimed is:
 1. A complex computing system comprising: a plurality of nodes connected to each other by channels along which time stamped data messages are sent and received, each timestamp being indicative, generation by generation, of its seniority acquired through its ancestors' arrival in the system and in any upstream nodezs, and each node including: means for storing each input data message, means for determining the seniority of input data messages by progressive comparison of respective generations in the timestamups until a first distinction exists, means for delivering the input data messages for processing, means for applying a timestamp to each output message derived from such processing comprising an immediately ancestral message's timestamp augmented by a new generation seniority indicator consistent with the ordering, and means for outputting such ordered and timestamped messages.
 2. A complex computing system as claimed in claim 1, wherein the means for delivering is arranged to deliver messages in order according to which message has the most senior timestamp indicator.
 3. A complex computing system as claimed in claim 1, wherein for a data message received from outside the system, an initial timestamp indicator includes an indication of a time of receipt of said data message at the node.
 4. A complex computing system as claimed in claim 1, wherein for a data message generated by a node of the system, a new generation seniority indicator of the timestamp includes an indication of the place of said data message in the ordered sequence of such messages at said node.
 5. A complex computing system as claimed in claim 1, wherein monotonic integers are utilized as said generation seniority indicators in the timestamps.
 6. A complex computing system as claimed in claim 1, wherein the means for delivering of a node delivers data messages only once a message has been received on each input channel of said node.
 7. A complex computing system as claimed in claim 1, wherein the means for delivering of a node delivers data messages only when at least one data message received on each input channel of said node is stored in the storage means.
 8. A complex computing system as claimed in claim 6, wherein each node is adapted to perform at least one channel flushing routine triggerable by lack or paucity of channel traffic.
 9. A complex computing system as claimed in claim 1, wherein all data messages caused by a first data message anywhere in the system are delivered to a node before any messages caused by a second data message, junior to the first data message, are delivered to said node.
 10. A method of ordering data messages within a complex computing system comprising a plurality of nodes connected to each other by channels along which data messages are sent and received, the method comprising, for each node, timestamping each message on arrival, queuing messages until a message has been received on each input channel to the node, and delivering the queued messages for processing sequentially in accordance with their timestamps, the message having a most senior timestamp being delivered first, wherein timestamping at each node is cumulative so that a timestamp of a particular message indicates a seniority acquired by that message, generation by generation, and wherein the seniority of one message against another is determined by a progressive comparison of respective generations in the timestamps until a first distinction exists.
 11. A complex computing system as claimed in claim 1, wherein the system is either a distributed computing system, a symmetric multi-processor computer, or a massively parallel processor computer. 